## 'data.frame': 3380 obs. of 5 variables:
## $ state : chr "WA" "CT" "FL" "OH" ...
## $ potency: num 77 51 68 69 75 73 54 58 77 49 ...
## $ weight : num 217 248 43 123 118 127 50 140 127 74 ...
## $ month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ price : num 5000 4800 3500 3500 3400 3000 3000 2800 2600 2600 ...
## state potency weight month
## Length:3380 Min. : 2.00 Min. : 1.00 Min. : 1.000
## Class :character 1st Qu.:49.00 1st Qu.: 3.00 1st Qu.: 3.000
## Mode :character Median :63.00 Median : 11.00 Median : 7.000
## Mean :62.05 Mean : 23.72 Mean : 6.414
## 3rd Qu.:76.00 3rd Qu.: 29.00 3rd Qu.: 9.000
## Max. :98.00 Max. :505.00 Max. :12.000
## price
## Min. : 10.0
## 1st Qu.: 200.0
## Median : 500.0
## Mean : 814.3
## 3rd Qu.:1100.0
## Max. :9000.0
We will only consider here the seizures with weight below 200 grams. Define a new data-frame “cocaine2” accordingly. Hint: You can use the command subset() to consider only part of the data set.
In which three states do seizures happen most frequently? Investigate this graphically. Hint: Create a barplot of the variable state.
# Florida, New York, Virginia
# Yes, there is quite a high tendency for seizures of smaller weight to happen more frequently.
# Price in relation to weight:
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
# Price in relation to potency:
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
# Not log transformed
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
# Log transformed
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
# Potency against weight (potency on x axis)
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
#Potency against weight (potency on y axis)
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
# Again the graph shows that there is no correlation between potency and price and weight does not affect potency.
The aim of this exercise is to check if there is a relation between the average arrival delay and the time of departure of planes. Load the package nycflights13, which contains the on-time data flights, using the com- mand require(nycflights13). The flights data set is about all the flights departing from one of the airport of New York in 2013.
In particular, the interest lies in the following variables: • hour, minute: the hour and minute of the departure • arr delay: the arrival delay of the incoming plane (in minutes) • dest: the destination.
Use the following function call in order to calculate the average delay per value of time: aggregate(formula = arr_delay ~ time, data = flights, FUN = …, na.rm = …)
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
The plot shows a break in data points between 12pm and 5am. This makes sense as most flights are around 8am and late afternoon/early evening. The histogram confirms this assumption. Most delays happen before 10am and around 8am.
Divided by 100 in an effort to be able to more clearly visualize the points. Scaled the scaled points.
# The second plot is more informative because it represents flight count informatino in addition to delays.
The goal is to explore if there are large differences between destination regarding arrival delay and number of flights. We work again with the flights data set in the package nycflights13 from Exercise 2. If you need to reload the data set, use the command require(nycflights13).
Calculate the average value of the arrival delay arr_delay for each destination (dest). Omit all the missing values in the calculation. Hint: Use the function aggregate(). The argument na.rm = TRUE of the function mean() allows to omit missing values in the calculation of the mean. Note that the function aggregate() creates a dataframe with first column corresponding to the grouping variable (here dest). Save the output of the function aggregate() as a new data frame delay.per.dest.
Calculate the number of planes departing to each destination. Add those counts as variable n to the data frame delay.per.dest. Hint: Use again aggregate() but only save the second column of the output.
Merge the data frames delay.per.dest and airports in order to add the coordinates (lon, lat) of the airports to delay.per.dest. The data frame airports is included in the package nycflights13. Hint: Use the function merge(x = delay.per.dest, y = …, by.x = “dest”, by.y = “faa”, all.x = T, all.y = F) Look at the help file of the function merge() by typing ?merge to understand what the di↵erent arguments mean.
Create a scatterplot of the latitude against the longitude and scale the points according to the number of departing planes. Hint: Use the argument size = … in the function aes().
## Warning: Removed 4 rows containing missing values (geom_point).
e) Moreover, color the points by the value of the average arrival delay. What do you notice?
# Larger airports seem to have closer to zero delays verses smaller airports tend to have either a high occurence of delays or a high occurence of flights leaving early.
Gapminder Foundation wants to give access to a fact-based world view in order to promote a sustainable global development. For more information and entertaining videos; see http: //www.gapminder.org/. The aim of this exercise is to obtain a nice visualization of the life expectancy vs. the GDP per capita. This is achieved by successively adding more functions to a basic function call. Load the package gapminder using the command require(gapminder, quietly = TRUE) which contains the data set gapminder and the vector country_colors. Take a first look at the data by looking at the help files by typing the commands ?gapminder, ?country_colors and str(). Consider first the data set gapminder.